Skip to content

fix: round-3 retrospective mechanical fixes (#457, #455, #459)#462

Merged
alexandru-savinov merged 4 commits into
mainfrom
fix/round3-retrospective
Jun 3, 2026
Merged

fix: round-3 retrospective mechanical fixes (#457, #455, #459)#462
alexandru-savinov merged 4 commits into
mainfrom
fix/round3-retrospective

Conversation

@alexandru-savinov

Copy link
Copy Markdown
Owner

Round-3 retrospective mechanical fixes

Four small, independently-verified fixes surfaced by the commit-history retrospective (epic #454, round 3). Each is one commit, scoped to one issue.

Commit Issue What
fix(zdr) #457 Open-WebUI ZDR pipe reads canonical model_id/id first (rsplit only as last-resort fallback) — it was deriving the model id from name.rsplit(" | "), leaking display fragments like Qwen3 Coder instead of qwen/qwen3-coder:free. Same bug the proxy fixed in #421, never propagated to the pipe. Stub updated to the real upstream shape + regression test added.
fix(reliability) #455 Add TimeoutStartSec to non-serve oneshots with wait-loops so they aren't SIGTERM'd at the 90s host default: restic-check (0/unbounded — a killed check also fires a false backup-failure-alert), restic-backups (4h), n8n-community-packages (200), and the three Open-WebUI DB-migration oneshots (300).
fix(reliability) partially #451 zero-kuzea (4GB CX22 GRUB) gets a 2GB swapfile; --max-jobs 1 --cores 1 throttle added to install.sh + deploy.sh build/switch so a small-RAM VPS build can't OOM.
fix(docs) #459 Correct the CLAUDE.md Disk-Offload section: the assembly node must stream output to disk (not read everything back into memory, which OOM'd after #438). Cites e098d06.

Validation

Run on a darwin host (no aarch64 builder), so validated by evaluation:

CI's Trivy secret scan + aarch64 eval are the canonical gates. Based on top of #461.

Notes

🤖 Generated with Claude Code

alexandru-savinov and others added 4 commits June 3, 2026 20:51
The /endpoints/zdr model-list builder derived ids from name.rsplit(" | "),
which leaks the human-readable display fragment (e.g. "Qwen3 Coder") into
model ids for providers whose name right-half is the model_name rather than
the slug. Read the canonical model_id (then id) first, keeping the
name-rsplit only as a last-resort fallback — mirrors openclaw-zdr-proxy
_parse_zdr_response (PR #421). Updated the test fixture to the real upstream
shape and added a regression assertion that no display name leaks into an id.

Note: nix fmt skipped — formatter (nixpkgs-fmt) is only defined for Linux
systems and this is an aarch64-darwin host; no .nix files changed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…n-webui oneshots (#455)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…uilds (#451)

Add a 2GB /swapfile to the 4GB zero-kuzea CX22 host (mirroring sancta-choir/
sancta-claw) and pass --max-jobs 1 --cores 1 to the nixos-rebuild build/switch
invocations in scripts/install.sh and scripts/deploy.sh so the build step
cannot OOM a small-RAM VPS.

nix fmt N/A on this aarch64-darwin host (formatter only defined for Linux).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ut (#459)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexandru-savinov alexandru-savinov added claude Created/proposed by Claude Code reliability Reliability improvements labels Jun 3, 2026
@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Code Review

Summary Table

Severity File Issue
MEDIUM modules/services/backup-pull.nix TimeoutStartSec = 0 disables timeout entirely on restic-check — service can hang indefinitely
MEDIUM modules/services/n8n.nix TimeoutStartSec = 200 with ~180s wait-loops leaves only 20s margin; variance could cause spurious timeouts
LOW scripts/deploy.sh, scripts/install.sh --max-jobs 1 --cores 1 significantly slows builds on multi-core machines
LOW hosts/zero-kuzea/configuration.nix Swapfile on SSD-backed VPS increases write wear; acceptable for low-write swap usage

modules/services/backup-pull.nix

TimeoutStartSec = 0 (restic-check service) — MEDIUM

Setting TimeoutStartSec = 0 disables the start timeout entirely. The comment correctly identifies the problem (false backup-failure-alert when the check is killed mid-run), but the fix trades a false-alarm risk for an unbounded hang risk. If restic check stalls due to a network issue, corrupted repo metadata, or a lock left by a previous run, the unit will never exit, blocking any After= dependents and accumulating zombie workers if timers keep firing.

A tighter ceiling (e.g. TimeoutStartSec = "12h") would still be far above any realistic check duration while preserving liveness guarantees.

TimeoutStartSec = "4h" (restic-backup service) — OK

Reasonable for large backup runs. No issues.


modules/services/n8n.nix

TimeoutStartSec = 200 — MEDIUM

Comment says "~180s of internal wait-loops". 200 - 180 = 20s margin. Any extra retry, slow n8n startup, or load spike on the RPi5 could push past 200s and trigger a SIGTERM, which is the exact failure mode being fixed. 300 or 600 would match the Open-WebUI oneshots in this PR and give safer headroom.


modules/services/open-webui.nix

Three oneshots set to TimeoutStartSec = 300. Comments say ~90s of wait-loops; 300s is 3x headroom. No issues.


modules/services/open-webui-functions/openrouter_zdr_pipe.py

Correct fix. Reading model_id -> id -> name.rsplit fallback chain is the right approach. The added isinstance(item, dict) guard is a safe defensive check. Regression test properly exercises the canonical-field path and verifies display fragments do not leak into model IDs. No issues.


hosts/zero-kuzea/configuration.nix

2GB swapfile on a CX22 (40GB SSD). size = 2048 is MiB, so correctly 2GB. NixOS swapDevices with device = "/swapfile" will create and activate it. Appropriate OOM headroom for a 4GB RAM VPS. No issues.


scripts/deploy.sh + scripts/install.sh

--max-jobs 1 --cores 1 prevents parallel builds from OOM-ing on small-RAM machines. Tradeoff: build times will be significantly longer on multi-core hosts. Defensible for install scripts targeting memory-constrained targets. Acceptable as-is.


CLAUDE.md

Docs-only update clarifying the streaming disk-offload pattern. Accurate, no issues.


Verdict

No CRITICAL or HIGH issues. The two MEDIUM findings trade one failure mode for another rather than eliminating the risk, but both are low-probability in practice. The PR is safe to merge; the MEDIUM items are reasonable follow-up candidates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claude Created/proposed by Claude Code reliability Reliability improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(open-webui): ZDR pipe builds model id from rsplit(display name) not canonical model_id (proxy bug #421 never propagated)

1 participant